The Assignment

Part 1: Factor management

With the data set of your choice, after ensuring the variable(s) you’re exploring are indeed factors, you are expected to:

  1. Drop factor / levels;
  2. Reorder levels based on knowledge from data.

Elaboration for the gapminder data set

Drop Oceania. Filter the Gapminder data to remove observations associated with the continent of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.

First let’s load the gapminder dataset and packages:

suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library())

Now let’s see how many levels are in the gapminder dataset by continent:

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5

Let’s make a new variable, gapminder_drop_oceania which filters out by the continent Oceania:

gapminder_drop_oceania <- gapminder %>% filter(continent != "Oceania")
levels(gapminder_drop_oceania$continent) #even though we filtered it out, it is not dropped yet
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Using the droplevels() function we could now pipe the dataset to filter it out:

gapminder_drop_oceania2 <- gapminder %>% filter(continent != "Oceania") %>% droplevels()
levels(gapminder_drop_oceania2$continent) #now you can see we have dropped the Oceania level
## [1] "Africa"   "Americas" "Asia"     "Europe"

Reorder the levels of country or continent. Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.

Let’s filter out a variable such as the continent, Americas, and the year 2002:

gap_Americas_2002 <- gapminder %>% 
  filter(year == 2002, continent == "Americas")

Now let’s reorder by decending life expectancy:

gap_Americas_2002 %>% 
    mutate(country = fct_reorder(country, desc(lifeExp))) %>% 
    ggplot(aes(lifeExp, country)) + 
    geom_point(colour = "Red") +
    labs(y = "Country", x = "Life Expectancy")

It’s interesting to note that the lowest life expectancy in the Americas during 2002 was Haiti and the highest was Canada.

Part 2: File I/O

Experiment with one or more of write_csv()/read_csv() (and/or TSV friends), saveRDS()/readRDS(), dput()/dget(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. I highly recommend you fiddle with the factor levels, i.e. make them non-alphabetical (see previous section). Explore whether this survives the round trip of writing to file then reading back in.

We will now write the file to the working directory:

write_csv(gap_Americas_2002,"Americas_2002", col_names = TRUE)

We can also read it back using the read_csv function:

read_back <- read_csv("Americas_2002")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )

To prove that it worked lets check:

knitr::kable(read_back)
country continent year lifeExp pop gdpPercap
Argentina Americas 2002 74.340 38331121 8797.641
Bolivia Americas 2002 63.883 8445134 3413.263
Brazil Americas 2002 71.006 179914212 8131.213
Canada Americas 2002 79.770 31902268 33328.965
Chile Americas 2002 77.860 15497046 10778.784
Colombia Americas 2002 71.682 41008227 5755.260
Costa Rica Americas 2002 78.123 3834934 7723.447
Cuba Americas 2002 77.158 11226999 6340.647
Dominican Republic Americas 2002 70.847 8650322 4563.808
Ecuador Americas 2002 74.173 12921234 5773.045
El Salvador Americas 2002 70.734 6353681 5351.569
Guatemala Americas 2002 68.978 11178650 4858.347
Haiti Americas 2002 58.137 7607651 1270.365
Honduras Americas 2002 68.565 6677328 3099.729
Jamaica Americas 2002 72.047 2664659 6994.775
Mexico Americas 2002 74.902 102479927 10742.441
Nicaragua Americas 2002 70.836 5146848 2474.549
Panama Americas 2002 74.712 2990875 7356.032
Paraguay Americas 2002 70.755 5884491 3783.674
Peru Americas 2002 69.906 26769436 5909.020
Puerto Rico Americas 2002 77.778 3859606 18855.606
Trinidad and Tobago Americas 2002 68.976 1101832 11460.600
United States Americas 2002 77.310 287675526 39097.100
Uruguay Americas 2002 75.307 3363085 7727.002
Venezuela Americas 2002 72.766 24287670 8605.048
kable(read_back) %>%
  kable_styling("striped", full_width = F)
country continent year lifeExp pop gdpPercap
Argentina Americas 2002 74.340 38331121 8797.641
Bolivia Americas 2002 63.883 8445134 3413.263
Brazil Americas 2002 71.006 179914212 8131.213
Canada Americas 2002 79.770 31902268 33328.965
Chile Americas 2002 77.860 15497046 10778.784
Colombia Americas 2002 71.682 41008227 5755.260
Costa Rica Americas 2002 78.123 3834934 7723.447
Cuba Americas 2002 77.158 11226999 6340.647
Dominican Republic Americas 2002 70.847 8650322 4563.808
Ecuador Americas 2002 74.173 12921234 5773.045
El Salvador Americas 2002 70.734 6353681 5351.569
Guatemala Americas 2002 68.978 11178650 4858.347
Haiti Americas 2002 58.137 7607651 1270.365
Honduras Americas 2002 68.565 6677328 3099.729
Jamaica Americas 2002 72.047 2664659 6994.775
Mexico Americas 2002 74.902 102479927 10742.441
Nicaragua Americas 2002 70.836 5146848 2474.549
Panama Americas 2002 74.712 2990875 7356.032
Paraguay Americas 2002 70.755 5884491 3783.674
Peru Americas 2002 69.906 26769436 5909.020
Puerto Rico Americas 2002 77.778 3859606 18855.606
Trinidad and Tobago Americas 2002 68.976 1101832 11460.600
United States Americas 2002 77.310 287675526 39097.100
Uruguay Americas 2002 75.307 3363085 7727.002
Venezuela Americas 2002 72.766 24287670 8605.048

Part 3: Visualization design

Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.

Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?

Here is a plot I created early in the semester:

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  scale_x_log10() +
  geom_point(colour = "blue", alpha=0.2)

Let’s try to revamp this:

revamp <- gapminder %>% 
     ggplot(aes(gdpPercap, lifeExp)) +
     geom_point(aes(colour=pop), alpha=0.2) +
     scale_x_log10( ) +
     scale_colour_distiller(
         trans   = "log10",
         breaks  = 5^(1:5),
         palette = "Blue" 
     ) + theme_light() + labs(title="Life Expectancy and GDP Per Capita") +
  ylab("Life Expectancy") +
  xlab("GDP Per Capita") +
  facet_wrap(~ continent) +
     scale_y_continuous(breaks=10*(1:10))
## Warning in pal_name(palette, type): Unknown palette Blue
ggplotly(revamp)